Problem Statement

—- INSERT ———–

Main Objectives

—- INSERT ———-

Key Ressources

—– INSERT ——–

High-Level Process

—– INSERT ———-

Part 1

Importing

##                  ID          Year_Birth           Education      Marital_Status 
##                   0                   0                   0                   0 
##              Income             Kidhome            Teenhome         Dt_Customer 
##                  24                   0                   0                   0 
##             Recency            MntWines           MntFruits     MntMeatProducts 
##                   0                   0                   0                   0 
##     MntFishProducts    MntSweetProducts        MntGoldProds   NumDealsPurchases 
##                   0                   0                   0                   0 
##     NumWebPurchases NumCatalogPurchases   NumStorePurchases   NumWebVisitsMonth 
##                   0                   0                   0                   0 
##        AcceptedCmp3        AcceptedCmp4        AcceptedCmp5        AcceptedCmp1 
##                   0                   0                   0                   0 
##        AcceptedCmp2            Complain       Z_CostContact           Z_Revenue 
##                   0                   0                   0                   0 
##            Response             Country 
##                   0                   0

Massaging

Part 1a - Regression - Predictive

Label: Web Purchases Features: All numerical

Motivation

Use of linear regression to classify the important variables that help us predict web purchases.

Method

  • Checklist for Regression
  • Split into training and test data
  • Scale data
  • Train model using train data
  • Use the fitted model to predict unseen values (test-data)
  • Plot predicted vs actual values

Mechanics

## 
## Call:
## lm(formula = NumWebPurchases ~ ., data = mark_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2724 -0.9603 -0.1315  0.8752 23.7417 
## 
## Coefficients: (2 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -7.451e+00  1.009e+01  -0.739  0.46024    
## ID                      5.050e-06  1.472e-05   0.343  0.73152    
## Year_Birth             -3.617e-03  4.480e-03  -0.807  0.41954    
## EducationBasic         -2.888e-01  3.449e-01  -0.837  0.40254    
## EducationGraduation     2.877e-02  1.702e-01   0.169  0.86582    
## EducationMaster         1.134e-03  2.002e-01   0.006  0.99548    
## EducationPhD            2.328e-01  1.966e-01   1.184  0.23663    
## Marital_StatusAlone     3.092e+00  1.838e+00   1.682  0.09273 .  
## Marital_StatusDivorced  1.769e+00  1.440e+00   1.229  0.21939    
## Marital_StatusMarried   1.804e+00  1.434e+00   1.258  0.20842    
## Marital_StatusSingle    1.771e+00  1.436e+00   1.234  0.21743    
## Marital_StatusTogether  1.905e+00  1.435e+00   1.328  0.18445    
## Marital_StatusWidow     1.710e+00  1.457e+00   1.174  0.24059    
## Marital_StatusYOLO      2.692e+00  2.012e+00   1.338  0.18111    
## Income                  1.228e-05  2.609e-06   4.706 2.72e-06 ***
## Kidhome                -7.247e-01  1.225e-01  -5.916 3.98e-09 ***
## Teenhome                3.309e-01  1.099e-01   3.011  0.00264 ** 
## Dt_Customer             7.387e-04  2.743e-04   2.693  0.00714 ** 
## Recency                 1.583e-03  1.693e-03   0.935  0.35008    
## MntWines                2.412e-03  2.549e-04   9.462  < 2e-16 ***
## MntFruits               1.655e-03  1.640e-03   1.009  0.31299    
## MntMeatProducts        -5.707e-04  3.729e-04  -1.530  0.12611    
## MntFishProducts         2.095e-03  1.259e-03   1.664  0.09629 .  
## MntSweetProducts        8.047e-03  1.552e-03   5.186 2.40e-07 ***
## MntGoldProds            8.670e-03  1.113e-03   7.788 1.17e-14 ***
## NumDealsPurchases       2.362e-01  3.232e-02   7.309 4.10e-13 ***
## NumCatalogPurchases    -1.054e-02  2.912e-02  -0.362  0.71732    
## NumStorePurchases       1.630e-01  2.220e-02   7.343 3.19e-13 ***
## NumWebVisitsMonth       3.256e-01  3.056e-02  10.654  < 2e-16 ***
## AcceptedCmp3           -4.418e-02  1.951e-01  -0.226  0.82088    
## AcceptedCmp4           -6.598e-02  2.185e-01  -0.302  0.76273    
## AcceptedCmp5           -3.351e-01  2.424e-01  -1.382  0.16706    
## AcceptedCmp1            1.211e-01  2.268e-01   0.534  0.59353    
## AcceptedCmp2           -1.836e+00  4.499e-01  -4.080 4.70e-05 ***
## Complain                4.763e-01  5.028e-01   0.947  0.34366    
## Z_CostContact                  NA         NA      NA       NA    
## Z_Revenue                      NA         NA      NA       NA    
## Response                5.097e-01  1.636e-01   3.115  0.00187 ** 
## CountryCA              -2.422e-02  2.302e-01  -0.105  0.91622    
## CountryGER             -9.974e-02  2.828e-01  -0.353  0.72439    
## CountryIND              1.967e-02  2.598e-01   0.076  0.93965    
## CountryME              -4.668e+00  2.000e+00  -2.335  0.01968 *  
## CountrySA              -1.202e-01  2.216e-01  -0.542  0.58755    
## CountrySP              -1.525e-01  1.977e-01  -0.771  0.44064    
## CountryUS              -9.565e-02  2.906e-01  -0.329  0.74205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.979 on 1729 degrees of freedom
## Multiple R-squared:  0.5058, Adjusted R-squared:  0.4938 
## F-statistic: 42.14 on 42 and 1729 DF,  p-value: < 2.2e-16

Message

(Analysis of output) The first regression we run was built including all the variables we had from the case. This regression was meant to give us a first screening of the statistical significance every single variable brought into the model. Looking at the p-value from each row of the summary (which we decided to be < .05 to have a significant impact,) we could identify the variable with the highest potential. If we could have expected some variables to have an impact on web purchases, it would be number of web visits per month or number of store purchases. While on the other we see variables, initially we seemed not related to our goal, having a very high statistical significance in the model (i.e., amount sweet products, amount of gold products and number of kids residing in home. Now, in order to obtain a better regression, we excluded all the variables with a low impact and included just the ones with high potential, always basing our selection on p-values. We didn’t get rid of the variable of if a customer accepted the campaign: although we had a few observations that didn’t seem to be statistically important, we thought it wouldn’t make sense to eliminate this bunch of variables.

The second regression we run has similar R^2 and R^2 adjusted compared to the first one but this doesn’t mean the model hasn’t improved. In fact, we are now dealing with a lower number of variables. The F-statistic has raised from 41 to 183, meaning we gathered strong evidence that these variables are all statistically influential and that there’s a very low chance they have a value of 0. We imagined that when we would have used fewer models in our regression that it would get stronger but that was not the case for us. This could be due to significance the isolated variables can have throughout our entire analysis.

(Suggestions to CMO)

Observing the estimates values from our multiple (better) regression we can select the variables that impact the most, positively, or negatively, on the case. For example, the number of kids at home is the voice that has the most negative impact we know that the odds of business success will decrease by 45% (exp (-6.06 e-01 -1) %) for each kid at home in our data seems to impact negatively on the odds of web purchases. On the other hand, the number of teens at home and number of web visits per home have the highest positive influence in the regression. These variables bring respectively 43 % (exp (3.562 e-01)-1) and 42% (exp (3.530 e-01) -1) increase in the odds of business success for each unit increased of both these voices. Last, the variable with the least impact in the model ends up being Income, as it’s the closest to zero. It just brings 0.0032% (exp (3.175 e-05) -1) positive influence. Our recommendation would be of course to focus on the voices with the best positive impact and try to avoid as much as possible the ones lowering the odds of business success. For example, it’s important to keep in mind that targeting people with kids at home seems to be negative, on the other hand, it could be convenient to rely on people with teenagers at home, probably because teens are more willing than kids to use online sources. Moreover, the CMO should focus more on the number of web visits which has a 42% (exp (3.530 e-01) -1) positive impact. This could be done by allocating some budget on online ads or on everything that could bring visibility to the website. Speaking about budget allocation, it would be crucial to understand how to allocate and thus to select the variables with a low influence in the model. For example, our team thinks Income wouldn’t be such a critical aspect to focus on, the company should save resources by not targeting this market segment and maybe invest in others with better potential. In fact, besides Income, there are a few more variables that seem to have higher importance but still not as significant to justify an effort. Amount spent on wines, gold, and sweet products in the past two years estimates, for instance, are close to zero, and despite their relevance, could also not be taken into consideration in the context of budget allocation. In conclusion, as we have seen, even though the better regression includes just the most significant variables, it seems that from a statistical point of view just the variation of a few manage to change the odds of business success and so the output of a business reasoning.

We should also use the “income” variable to see the different type of classes they have within their data set. We should break this category into three tiers: low-income, middle-income, and high-income brackets to acquire more specific information for targeting campaigns to the varying income classes that are useful in this data set. As the two campaigns did very well and were closely related to in-store purchases we need to see the specifics of why these campaigns were successful. On the other hand, we need to investigate the failure of purchases from the other campaigns and lean away from repeating these tactics again to not be wasteful in budgeting and marketing resources.

We also highly advise personalizing the communications, deals, and preferences with more specific characteristics for personnel. I would say that more specific marketing should be based on a customer demographic: education level, marital status, income level, and age, for example, to have a more pinpointed approach with advertising. This could be evident with doing more in-depth analysis of why one campaign was more successful to another, based on who saw it and made a purchase. We need to investigate performing A/B testing to increase deals. Using resources more effectively with data analysis leads to efficient marketing tactics for increased Return on investment for marketing budgets.

Better Regression

Label: Web Purchases Features: Numerical Variables selected based on findings in Part a

## 
## Call:
## lm(formula = NumWebPurchases ~ ., data = mark_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2724 -0.9603 -0.1315  0.8752 23.7417 
## 
## Coefficients: (2 not defined because of singularities)
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -7.451e+00  1.009e+01  -0.739  0.46024    
## ID                      5.050e-06  1.472e-05   0.343  0.73152    
## Year_Birth             -3.617e-03  4.480e-03  -0.807  0.41954    
## EducationBasic         -2.888e-01  3.449e-01  -0.837  0.40254    
## EducationGraduation     2.877e-02  1.702e-01   0.169  0.86582    
## EducationMaster         1.134e-03  2.002e-01   0.006  0.99548    
## EducationPhD            2.328e-01  1.966e-01   1.184  0.23663    
## Marital_StatusAlone     3.092e+00  1.838e+00   1.682  0.09273 .  
## Marital_StatusDivorced  1.769e+00  1.440e+00   1.229  0.21939    
## Marital_StatusMarried   1.804e+00  1.434e+00   1.258  0.20842    
## Marital_StatusSingle    1.771e+00  1.436e+00   1.234  0.21743    
## Marital_StatusTogether  1.905e+00  1.435e+00   1.328  0.18445    
## Marital_StatusWidow     1.710e+00  1.457e+00   1.174  0.24059    
## Marital_StatusYOLO      2.692e+00  2.012e+00   1.338  0.18111    
## Income                  1.228e-05  2.609e-06   4.706 2.72e-06 ***
## Kidhome                -7.247e-01  1.225e-01  -5.916 3.98e-09 ***
## Teenhome                3.309e-01  1.099e-01   3.011  0.00264 ** 
## Dt_Customer             7.387e-04  2.743e-04   2.693  0.00714 ** 
## Recency                 1.583e-03  1.693e-03   0.935  0.35008    
## MntWines                2.412e-03  2.549e-04   9.462  < 2e-16 ***
## MntFruits               1.655e-03  1.640e-03   1.009  0.31299    
## MntMeatProducts        -5.707e-04  3.729e-04  -1.530  0.12611    
## MntFishProducts         2.095e-03  1.259e-03   1.664  0.09629 .  
## MntSweetProducts        8.047e-03  1.552e-03   5.186 2.40e-07 ***
## MntGoldProds            8.670e-03  1.113e-03   7.788 1.17e-14 ***
## NumDealsPurchases       2.362e-01  3.232e-02   7.309 4.10e-13 ***
## NumCatalogPurchases    -1.054e-02  2.912e-02  -0.362  0.71732    
## NumStorePurchases       1.630e-01  2.220e-02   7.343 3.19e-13 ***
## NumWebVisitsMonth       3.256e-01  3.056e-02  10.654  < 2e-16 ***
## AcceptedCmp3           -4.418e-02  1.951e-01  -0.226  0.82088    
## AcceptedCmp4           -6.598e-02  2.185e-01  -0.302  0.76273    
## AcceptedCmp5           -3.351e-01  2.424e-01  -1.382  0.16706    
## AcceptedCmp1            1.211e-01  2.268e-01   0.534  0.59353    
## AcceptedCmp2           -1.836e+00  4.499e-01  -4.080 4.70e-05 ***
## Complain                4.763e-01  5.028e-01   0.947  0.34366    
## Z_CostContact                  NA         NA      NA       NA    
## Z_Revenue                      NA         NA      NA       NA    
## Response                5.097e-01  1.636e-01   3.115  0.00187 ** 
## CountryCA              -2.422e-02  2.302e-01  -0.105  0.91622    
## CountryGER             -9.974e-02  2.828e-01  -0.353  0.72439    
## CountryIND              1.967e-02  2.598e-01   0.076  0.93965    
## CountryME              -4.668e+00  2.000e+00  -2.335  0.01968 *  
## CountrySA              -1.202e-01  2.216e-01  -0.542  0.58755    
## CountrySP              -1.525e-01  1.977e-01  -0.771  0.44064    
## CountryUS              -9.565e-02  2.906e-01  -0.329  0.74205    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.979 on 1729 degrees of freedom
## Multiple R-squared:  0.5058, Adjusted R-squared:  0.4938 
## F-statistic: 42.14 on 42 and 1729 DF,  p-value: < 2.2e-16

## Part 1b - Descriptive - US vs rest of world

Motivation

We intend to find if there is any significance in demographics with purchasing behavior between the U.S. and the rest of the world. This is extremely vital for the marketing department due to attaining an overview of which region in the world purchases more and how to more effectively allocate budget and effort with different strategies.

Method

Our approach was to aggregate the columns which included: purchases made in store, catalog purchases, purchases through deals, and web purchases. From there, we created dummy variables with 1 and 0. If the country was US, we would place a 1, and if not, we would input a 0. We compared the two different categories with our explanatory variable of using the dummy variable for the country and our response will be the total purchases. To find out statistical significance, we will calculate the 95% interval for the difference in means.

H0: U.S. <= Rest of the world Ha: U.S. > Rest of the world

Mechanics

The 95% confidence interval for the true difference in population means is between -2.9700892 and 0.0094266

library(plotly)

fig <- plot_ly(y = us_customer$Total_Purch, type = "box", name="US") %>%
        layout(title = 'Average Purchases per Country', plot_bgcolor = "#e5ecf6",xaxis = list(title = 'Country'),yaxis = list(title = 'Average Purchases'))
fig <- fig %>% add_trace(y = world_customer$Total_Purch, name='Rest of world') 
fig

```